Random Search

python
datacamp
machine learning
deep learning
hyperparameter
random search
Author

kakamana

Published

April 9, 2023

Random Search in Scikit Learn

  • Comparing to GridSearchCV
    • Decide an algorithm/estimator
    • Define which hyperparameters we will tune
    • Define a range of values for each hyperparameter
    • Setting a Cross-Validation scheme
    • Define a score function
    • Include extra useful information or function
  • In Random Search,
    • Decide how many samples to take and sample it

The RandomizedSearchCV Object

Just like the GridSearchCV library from Scikit Learn, RandomizedSearchCV provides many useful features to assist with efficiently undertaking a random search. You’re going to create a RandomizedSearchCV object, making the small adjustment needed from the GridSearchCV object.

The desired options are:

  • A default Gradient Boosting Classifier Estimator
  • 5-fold cross validation
  • Use accuracy to score the models
  • Use 4 cores for processing in parallel
  • Ensure you refit the best model and return training scores
  • Randomly sample 10 models The hyperparameter grid should be for learning_rate (150 values between 0.1 and 2) and min_samples_leaf (all values between and including 20 and 64).
Code
from sklearn.model_selection import train_test_split

credit_card = pd.read_csv('dataset/credit-card-full.csv')
# To change categorical variable with dummy variables
credit_card = pd.get_dummies(credit_card, columns=['SEX', 'EDUCATION', 'MARRIAGE'], drop_first=True)

X = credit_card.drop(['ID', 'default payment next month'], axis=1)
y = credit_card['default payment next month']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)
Code
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV

# Create the parameter grid
param_grid = {'learning_rate': np.linspace(0.1, 2, 150),
              'min_samples_leaf': list(range(20, 65))}

# Create a random search object
random_GBM_class = RandomizedSearchCV(
    estimator=GradientBoostingClassifier(),
    param_distributions=param_grid,
    n_iter=10,
    scoring='accuracy', n_jobs=4, cv=5, refit=True, return_train_score=True
)

# Fit to the training data
random_GBM_class.fit(X_train, y_train)

# Print the values used for both hyperparameters
print(random_GBM_class.cv_results_['param_learning_rate'])
print(random_GBM_class.cv_results_['param_min_samples_leaf'])
[0.9416107382550335 0.7375838926174496 1.719463087248322
 0.7885906040268456 1.8469798657718122 1.6684563758389261
 0.3932885906040269 0.22751677852348992 0.7248322147651006
 0.11275167785234899]
[20 46 37 47 28 26 51 35 57 48]

RandomizedSearchCV in Scikit Learn

Let’s practice building a RandomizedSearchCV object using Scikit Learn.

The hyperparameter grid should be for max_depth (all values between and including 5 and 25) and max_features (‘auto’ and ‘sqrt’).

The desired options for the RandomizedSearchCV object are:

  • A RandomForestClassifier Estimator with n_estimators of 80.
  • 3-fold cross validation (cv)
  • Use roc_auc to score the models
  • Use 4 cores for processing in parallel (n_jobs)
  • Ensure you refit the best model and return training scores
  • Only sample 5 models for efficiency (n_iter) Remember, to extract the chosen hyperparameters these are found in cv_results_ with a column per hyperparameter. For example, the column for the hyperparameter criterion would be param_criterion.
Code
from sklearn.ensemble import RandomForestClassifier

# Create the parameter grid
param_grid = {'max_depth': list(range(5, 26)), 'max_features': ['auto', 'sqrt']}

# Create a random search object
random_rf_class = RandomizedSearchCV(
    estimator=RandomForestClassifier(n_estimators=80),
    param_distributions=param_grid, n_iter=5,
    scoring='roc_auc', n_jobs=4, cv=3, refit=True, return_train_score=True
)

# Fit to the training data
random_rf_class.fit(X_train, y_train)

# Print the values used for both hyperparameters
print(random_rf_class.cv_results_['param_max_depth'])
print(random_rf_class.cv_results_['param_max_features'])
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
[23 15 23 5 8]
['sqrt' 'sqrt' 'auto' 'sqrt' 'sqrt']